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We are thankful to the three discussants for their 
helpful and stimulating comments to our work. We 
value especially the many suggestions for extensions 
of the methodology given by all three discussants. 

One such suggested extension, with which we were 
very pleasantly surprised, was Meinshausen's dis- 
cussion of the defining hypotheses. The defining hy- 
potheses of a closed testing result, as we presented 
them, describe the result of the closed testing proce- 
dure as a union of intersections of hypotheses. Mein- 
shausen's suggestion is to rewrite the same result as 
an intersection of unions, which can always be done 
with some basic algebra. We like to call the resulting 
collection the shortlist, because it shortlists the can- 
didate combinations of false hypotheses: with 1 — a 
confidence at least one of the shortlist sets is a subset 
of the actual set of false hypotheses. Rewriting the 
result in this way gives a surprisingly complemen- 
tary perspective on the results of the procedure, that 
is intuitive and can be very helpful for interpreta- 
tion of the test procedure's results, as demonstrated 
by Meinshausen. We have added the possibility to 
calculate the shortlist to the cherry package, with 
thanks. 

An interesting variant to our procedure was sug- 
gested by Heller. In this variant not just the ele- 
mentary hypotheses are candidates for validation, 
but also (sub)families of hypotheses are of interest 
by themselves. Such subfamilies of hypotheses can 
be represented by intersection hypotheses in the clo- 
sure. An example of this can be a genomic data anal- 
ysis in which both single genes and gene sets are of 
interest. It is relatively easy to extend the method- 
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ology to allow confidence statements on #(7£nT) 
where 1Z is a collection of index sets representing in- 
tersection hypotheses that are chosen as candidates 
for validation. Confidence bounds t a (TZ) can be de- 
rived using similar reasoning as used to find the fa- 
miliar t a (R). 

The link between our proposed approach and the 
partial conjunction approach of Heller is a strong 
one, which we acknowledge and to which we should 
perhaps have pointed more explicitly. In our no- 
tation, the partial conjunction hypothesis H u l n is 
given by 

H u/n = |J Hi 

where C u = {C E C: \C\ = u}. This is exactly the 
union of hypotheses that has to be rejected in the 
closed testing procedure to be able to conclude that 
t a (R) < u for R = {1,. . . , n}. Our method can be 
seen as extending upon Heller's work by allowing 
other choices of R, but diverging from it where the 
issue of multiple testing of partial conjunction hy- 
potheses is concerned, a problem which Benjamini 
and Heller (2008) addressed in an FDR context. In 
her discussion, Heller introduces the partial conjunc- 
tion hypothesis H U ^ R ^ , which we would prefer to de- 
note H u / R because it depends on the set R, not just 
on its cardinality, and formally define it 

H u/R = JJ Hj 

where = {I C R: | J| = u}. This is indeed a central 
hypothesis to the approach we presented, which can 
alternatively be described as simultaneously test- 
ing H u / R for all sets RC {1, . . . , n} and for all 1 < 
u < \R\. Thinking of the procedure in terms of such 
partial conjunctions is a valuable perspective on our 
procedure. We also value Heller's concept of suf- 
ficient combining functions, which promises to be 
useful for finding new shortcuts and proving their 
validity. 

The computational problems associated with 
closed testing have been rightly stressed by both 
Meinshausen and Westfall, and we are aware of these 
problems. We cannot stress enough that shortcuts 
are crucial for the usability of the method we have 
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proposed unless the number of tested hypotheses is 
small. However, we think that the shortcuts we have 
described in our paper are only a beginning, and 
that many more shortcuts are possible. Of practical 
relevance to genomics research are especially short- 
cuts of the types discussed in Section 4.4 of our 
paper, in which a limited number of intersection 
hypotheses are tested with a non-consonant test, 
while the rest of the hypotheses can be tested using 
weighted Bonferroni-based combinations of these test 
results. Such shortcuts are relatively easy to design 
and they can be tailored to the specific needs of prac- 
tical testing problems. Other, more general short- 
cuts are likely to be found as well. 

Some of the issues raised by the discussants re- 
quire a somewhat more thorough discussion or give 
rise to some interesting elaborations of the theory. 
We will take the opportunity to go into a few sub- 
jects more deeply, elaborating on the issue of power, 
mentioned by Meinshausen and Westfall, on the is- 
sues of restricted combinations and adjusted p- values, 
both discussed by Westfall, and finally on the com- 
plicated practice of exploratory research, as com- 
mented on by Heller. 

1. POWER OF THE PROPOSED APPROACH 

Both Meinshausen and Westfall commented on 
the power of our proposed procedure, illustrating 
their points with example data and simulations. Pow- 
er is a crucial consideration not just in confirma- 
tory, but also in exploratory settings. It is difficult, 
however, to talk about the power of our method in 
a general way, because the closed testing procedure 
that underlies it is extremely versatile. The power 
properties of the procedure depend crucially on the 
power properties of the chosen local test. We will 
illustrate this by looking at the examples given by 
the two discussants in more detail. 

Westfall analyzes the famous Golub et al. (1999) 
microarray dataset using both Fisher combinations 
and Bonferroni as a local test, the latter leading 
to Holm's (1979) procedure. On a familywise er- 
ror of 0.05, Holm finds 37 out of 7,129 elementary 
hypotheses to be individually significant, whereas 
Fisher combinations do not find any significant ele- 
mentary hypotheses. Taking into account that Fisher 
combination tests are known to be anti-conservative 
in these data due to correlations among the test 
statistics, this comparison does indeed seem to come 
out clearly in favor of Bonferroni/Holm. This assess- 
ment changes, however, if we try to make a state- 



ment that is not of familywise error type, such as 
counting how many false hypotheses are present 
among the 7,129 hypotheses. The procedure based 
on Bonferroni states at 95% confidence that the 37 
hypotheses found with familywise error control are 
false, but can say no more than that. The method 
based on Fisher combinations, on the other hand, 
although it could not confidently point to any indi- 
vidual hypothesis as false, finds with 95% confidence 
that no fewer than 1,828 false hypotheses are present 
among the 4,082 hypotheses with smallest p-values. 
This example illustrates that different local tests re- 
sult in procedures with completely different prop- 
erties. Procedures based on highly consonant tests, 
such as Bonferroni's, tend to have good power for in- 
tersection hypotheses Hj of low cardinality \I\ and, 
consequently, are the method of choice for family- 
wise error statements. Procedures based on highly 
non-consonant tests, such as Fisher combinations, 
tend to have good power for intersection hypothe- 
ses Hj of high cardinality |/|, and, consequently, 
typically give superior bounds t a (R) for large sets R. 
It is interesting to note that the Simes local test 
takes an intermediate course for this dataset, finding 
the same 37 hypotheses to be significant from a fam- 
ilywise error perspective, but finding 111 additional 
false null hypotheses in total among the 7,129 genes 
due to its additional non-consonant rejections. 

Related remarks can be made about the simu- 
lation study performed by Meinshausen. His sim- 
ulated alternatives with a large number of hypothe- 
ses m have a very sparse but strong signal. This is 
a kind of signal that Fisher combinations are not 
very good at detecting, as is clearly illustrated by 
the simulation results. The observed low power in 
the simulation is more a feature of Fisher combi- 
nations as a local test than of the confidence set 
method as such. If a sparse but strong signal was 
expected, Fisher combinations should not have been 
chosen as the local test. If we redo the simulation 
with Simes local tests we get a completely different 
picture (Figure 1), with comparable power to Fisher 
combinations for low values of m, and only slightly 
lower power compared to Meinshausen (2006) for 
large values of m. 

Depending on the alternative that is to be picked 
up, and depending on the type of statements that 
are to be made, different choices of local tests may 
lead to procedures with good or bad power. Ob- 
viously, this gives much room for the development 
of powerful procedures tailored to specific research 
questions on specific types of data. 
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Fig. 1. Simulation study as in Figure 1(b) in Memshausen's 
discussion, but with Simes' local tests added. Meinshausen's 
method (circles), Fisher combinations (diamonds) and Simes 
(plusses). 

2. RESTRICTED COMBINATIONS 

Westfall raises the important issue of restricted 
combinations. Restricted combinations occur when, 
because of logical relationships between hypotheses, 
the collection T of true null hypotheses is a priori 
restricted, and some elements of the closure cannot 
be equal to the true set. In the example given by 
Westfall, the hypotheses are Hi : \x\ = fj.%, H2: fi\ = 
Hz and H3: fi2 = /U3. For these hypotheses, T = {1, 2} 
is not possible, as simultaneous truth of Hi and H2 
implies truth of H3. Similarly, any other set T of 
cardinality 2 is excluded, and T can only take the 
values 0, {1}, {2}, {3} or {1,2,3}. We call those 
sets that cannot be equal to T incongruent. There 
is an immense body of literature on multiple testing 
in the presence of restricted combinations, starting 
with the famous paper of Shaffer (1986), but we did 
not consider this issue in our paper. 

Westfall claims that the method we have proposed 
may be conservative if restricted combinations are 
present. This is true. However, a very simple exten- 
sion of the method can remove this conservative- 
ness in a general way, which is very similar to West- 
fall's treatment of the specific example. This exten- 
sion follows from the Sequential Rejection Princi- 
ple (Goeman and Solari, 2010). Applied to closed 
testing, this principle states that the local test of 
each Hi, I £ C, when it is its turn to be tested, 
may assume that all hypotheses Hj, J D I are false. 
For an incongruent set /, falsehood of all such hy- 



potheses Hj immediately implies that Hj itself is 
false. Therefore, even the test that always rejects is 
a valid local test for the incongruent set /. Conse- 
quently, in the presence of restricted combinations, 
we may assume that the local test always rejects all 
incongruent sets. 

The same conclusion may also be arrived at in an 
alternative way, by using the partitioning principle 
(Finner and Strassburger, 2002) rather than closed 
testing to make the confidence sets. For each / £ C, 
let the corresponding partitioning hypothesis be 
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Suppose an a-level test is available for every Ji, 
I £ C, and let V C C be the index set of the parti- 
tioning hypotheses rejected by their corresponding 
test. Then, analogously to the closed testing based 
procedure, by the partitioning principle the inter- 
section hypothesis Hi, l£C, is rejected whenever 
J € V for every J D I. From this, we can make the 
set X of rejected intersection hypotheses and de- 
rive the upper confidence limit t a (R) as before. The 
whole procedure is completely analogous to the one 
presented in the paper, only the smaller partition- 
ing hypotheses Ji, I € C, take the role of the closure 
hypotheses Hi when finding the set X . As Ji C Hi, 
for every I, any valid test of Hi is also a valid test 
for Ji, but Ji = for every incongruent /, so that 
we can safely reject every incongruent J/. Conse- 
quently, again, we may assume that incongruent hy- 
potheses are always rejected by their local test. 

If we extend our method in this way, we have 
a general solution for the problem of restricted com- 
binations. Using this extension, the conservative- 
ness noted in Westfall's example disappears, and the 
stronger statements he obtained are recovered. For 
concrete examples of families with restricted combi- 
nations, finding good shortcuts that take restricted 
combinations into account may, of course, still be 
a challenging problem. 

3. ADJUSTED p-VALUES 

The adjusted p- value is an important feature of 
multiple testing procedures, that conveys valuable 
additional information over the simple decision to 
reject or not to reject, as Westfall rightly points out. 
We did not go into the issue of adjusted p- values in 
the paper because we wanted to stress the analogy 
with confidence intervals, which are typically calcu- 
lated for a fixed a. It is possible, however, to find an 
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Fig. 2. Confidence probability mass function for R — {waist, forearm, calf, thigh} (left-hand side) and R — {waist, forearm, 
height, thigh} (right-hand side). Cumulative percentages are given at the top of the figure. 



analogy to adjusted p- value that conveys the same 
type of additional information, or even more. 

By definition, an adjusted p- value of a certain hy- 
pothesis is the smallest a-level that allows rejection 
of that hypothesis. By giving the threshold level a 
that distinguishes those a-levels for which the hy- 
pothesis would or would not be rejected, the ad- 
justed p- value gives the information what inferences 
would have been obtained if some other value of a 
would have been chosen. Analogously, in the ex- 
ploratory setting, we may also vary the value of a 
and plot the upper confidence bound t a (R) of the 
number of true hypotheses t(R) as a function of a. 
Just like the adjusted p- value, this shows the depen- 
dence of our conclusions on the arbitrary choice of a. 

An intuitive way to visualize the dependence 
of t a (R) on a is through a plot analogous to a con- 
fidence distribution (Singh, Xie and Strawderman, 
2007). This can be obtained by interpreting the plot 
of t a (R) function of a as if it was the quantile 
function of some discrete probability distribution, 
and plotting the differential of that, that is, the as- 
sociated "probability mass function." For the sets 
R = {waist, forearm, calf, thigh} and R = {waist, 
forearm, height, thigh} in the example of the physi- 
cal data of Section 3 in the paper, this plot is given 
in Figure 2. From this plot, we can read off the 95% 
confidence limit to.9s(R) and the estimate t\/2{R) 
by finding the 95th quantile and the median, re- 
spectively. We can also read off the family wise error 



adjusted p- value at 1 minus the confidence distribu- 
tion at 0. 

It is important to realize that a confidence distri- 
bution is a random variable, not a probability dis- 
tribution, in the same way that an adjusted p- value 
is a random variable, not a probability. The repre- 
sentation as if it was a distribution should just be 
seen as a convenient way to visualize the dependence 
of t a (R) on a. It is the direct analogue of the ad- 
justed p-value for the method we have proposed. 

4. THE PRACTICE OF EXPLORATORY 
RESEARCH 

The methods we have presented still require a quite 
formal and planned design in which hypotheses have 
been formulated before data analysis and tests for 
intersection hypotheses are chosen beforehand. Such 
a way of working is close enough to actual data 
analysis in many genomics experiments, but it is 
far too formal to capture the great variety and free- 
dom of true exploratory research. In actual practice, 
researchers often first perform a goodness-of-fit test 
on the same data before deciding what test to do. 
Researchers typically do additional unplanned hy- 
potheses tests to detect the presence of effects sug- 
gested by plots of the same data. Also, researchers 
sometimes perform additional tests as a consequence 
of the nonsignificance of other test, because they 
are not satisfied with the non-significant result ob- 
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tained. It is impossible to capture the true complex- 
ity of exploratory research in any formal method. 

Heller suggests a two-stage approach that sepa- 
rates the data used for exploratory analysis in two 
parts. The first part is used as a pilot merely to de- 
cide what an appropriate model would be, and how 
intersection hypotheses should be tested, in a com- 
pletely free exploratory manner. The end result of 
this data analysis would be a list of hypotheses and 
a plan for the closed testing procedure to be used 
in the next exploratory phase, which is then for- 
mal enough to allow use of the methods we have 
proposed. This proposal is practical and elegantly 
simple, mimicking the data splitting between ex- 
ploratory and confirmatory research. It is also good 
that it stresses the need for a pilot experiment, which 
can be useful in many other respects as well. A prac- 
tical problem, however, may be that many crucial 
decisions, regarding, for example, which test can be 
expected to have most power or which model fits 
best, may require quite large sample size, so that 
relatively small pilot experiments may not be ad- 
equate. One can also object philosophically to the 
idea, saying that the same arguments that can be 
used to change the empirical cycle from a two-phase 
process into a three-phase one could again be used 
to add a new fourth initial phase to the cycle, be- 
cause the methods used in the pilot phase may be 
wrong or lacking in power. This way there would be 
no end to data splitting. 

In the end, we think exploratory research in its 
most general form is too fluid to be captured in for- 
mal methodology. We feel, however, that we have 



demonstrated that more things are possible in the 
exploratory context than was commonly thought, 
and we hope we have stimulated the discussion on 
multiple testing in exploratory research. In our turn, 
we have been greatly stimulated by the contribu- 
tions of the three discussants, for which we are thank- 
ful. 
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